jycstudyfandomcom-20200214-history
Simple Linear Regression
Simple linear regression is univariate. One independent and one dependent. Assume data generated is linear. y = observed value, y(hat) = estimated value, y(bar) = average, e = errors, n = sample size. Estimation methods(by minimising below) * Residuals * Least Squared Errors ~> This is the most commonly used estimation method and has the best properties * Minimum Absolute Deviation * Cramer = minimum Chi^2 * MAPE = Mean Absolute Percentage Error Assessment of Model * Check goodness of fit ** R^2 and Adjusted R^2 ~> the measure of fit *** Tells us how well the model fits the data *** Value from 0 to 1. Closer to 1 the better. *** SSreg/SStot = 1 - SSres/SStot *** For scientific data, 0.7 is considered good while for social sciences 0.6 is considered as good *** Adjusted R^2 is for adjustment for R^2 when new variables are added to the model leading to spurious increases of R^2 ** Standard Error ** F statistic p value ** t statistic p value * Check regression assumptions Standard Error = Sample estimate of standard deviation of the residuals ANOVA = Analysis of Variance * df = degrees of freedom = depends on the size and number of parameters to be estimated * df(tot) = n - k = where k is the number of explanatory variables and n is the total sample(?) * F = RegressionMS/Residual MS = this tells us the ratio of the explained to the unexplained variable Assumptions of linear regression * Linear relationship * Homoskedasticity, constant variance - C * Error is uncorrelated/independent with x or y - I * Error is has an Expected value of 0 - E * Response variable y and error is Normally distributed - N * NICE * If assumptions are wrong How to analyse residuals * Predicted vs Residual plot - check linearity and constant variance ** Can be used to check if E(error) = 0 ** E(error) can be checked numerically too ** If E(error) is not 0 then there is a mistake * Histogram of residuals - check normality/skew of errors * Q-Q or P-P plot - check normality/skew of errors ** Compare the plotted residuals to the histogram of standardised residuals ** bow shaped = skew ** S shaped = kurtosis, too peaked or too flat ** https://en.wikipedia.org/wiki/P%E2%80%93P_plot * Index plot - check independence of errors ** IF there is a correlation between the residuals and the index plot then there the independence assumption does not hold Prediction * Two types of prediction: In sample and Out of sample. * In sample is predicting missing values within the range of the sample. * Out of sample is predicting values that go beyond the range of the sample. ** Extrapolation is risky because we only know about the range of sample we collected ** If we don't know the actual value of y, we do not know the prediction error pe but we can still estimate the standard error of the prediction Outliers * Anomalous result. Can be called influential points because they influence our perception * How to detect ** Graph, frequency table, range of continuous variable, scatterplots, box-plots, plot standardised residuals against standardised fitted values * What to do? ** Try to find out where they came from. Usually coding errors. ** If they are errors, drop them. ** If they are not errors but influence the results greatly, drop them. ** But have to argue why ** Run sensitivity analysis *** Run two analyses one with outlier and one without to see how badly outlier affects results